Skip to content

datalake: compile-time schema descriptors for table_definition#30170

Merged
andrwng merged 1 commit intoredpanda-data:devfrom
andrwng:compile-time-iceberg-structs
Apr 21, 2026
Merged

datalake: compile-time schema descriptors for table_definition#30170
andrwng merged 1 commit intoredpanda-data:devfrom
andrwng:compile-time-iceberg-structs

Conversation

@andrwng
Copy link
Copy Markdown
Contributor

@andrwng andrwng commented Apr 15, 2026

Replace the hand-rolled schemaless_struct_type() with a compile-time schema descriptor system. The schema is defined as a type:

  using schemaless_desc = struct_desc<
    field_desc<"redpanda", rp_desc>>;

From this single definition:

  • schemaless_struct_type() generates the runtime iceberg::struct_type
  • index_of<"field_name"> provides compile-time field indices
  • type_field/value_field accessors replace brittle fields[0]
  • rp_struct_type()/rp_struct_value() provide named access

So code like this:

std::unique_ptr<iceberg::struct_value> build_rp_struct(
  model::partition_id pid,
  kafka::offset o,
  std::optional<iobuf> key,
  model::timestamp ts,
  model::timestamp_type ts_t,
  const chunked_vector<model::record_header>& headers) {
    auto system_data = std::make_unique<iceberg::struct_value>();
    system_data->fields.reserve(6);

    system_data->fields.emplace_back(iceberg::int_value(pid));
    system_data->fields.emplace_back(iceberg::long_value(o));
    // NOTE: Kafka uses milliseconds, Iceberg uses microseconds.
    system_data->fields.emplace_back(
      iceberg::timestamptz_value(ts.value() * 1000));
    ... brittle, carefully ordered code ...

...becomes:

std::unique_ptr<iceberg::struct_value> build_rp_struct(
  model::partition_id pid,
  kafka::offset o,
  std::optional<iobuf> key,
  model::timestamp ts,
  model::timestamp_type ts_t,
  const chunked_vector<model::record_header>& headers) {
    return rp_desc::build_value(
      val<"partition">(iceberg::int_value(pid)),
      val<"offset">(iceberg::long_value(o)),
      val<"timestamp">(iceberg::timestamptz_value(ts.value() * 1000)),

and the definitions of compile-time schemas becomes more crisply defined
as well.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • None

Copilot AI review requested due to automatic review settings April 15, 2026 07:01
@andrwng andrwng force-pushed the compile-time-iceberg-structs branch from 4139da5 to 74d8b19 Compare April 15, 2026 07:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a compile-time schema descriptor system for datalake table definitions, replacing hand-built runtime schema construction and brittle positional access for the embedded redpanda system struct.

Changes:

  • Added schema_descriptor.h providing struct_desc/field_desc descriptors, compile-time index_of, and typed type_field/value_field accessors.
  • Updated table_definition.* to define the schemaless schema via descriptors and to centralize construction of the redpanda struct_value.
  • Updated record_translator.cc to use typed accessors (rp_struct_type / rp_struct_value) instead of fields[0] indexing.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/v/datalake/table_definition.h Defines compile-time schema descriptors and adds typed accessors + build_rp_struct declaration.
src/v/datalake/table_definition.cc Builds runtime schema via descriptors and implements centralized build_rp_struct.
src/v/datalake/schema_descriptor.h New compile-time descriptor framework and typed field access helpers.
src/v/datalake/record_translator.cc Replaces brittle positional access with typed accessors for the redpanda nested struct.
src/v/datalake/BUILD Wires the new header and its dependencies into the build.

Comment thread src/v/datalake/schema_descriptor.h Outdated
Comment thread src/v/datalake/schema_descriptor.h
@andrwng andrwng force-pushed the compile-time-iceberg-structs branch 5 times, most recently from 06180af to e4024cb Compare April 15, 2026 07:55
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool. The structural ct_string is a nice touch. I need to review with fresh eyes, but my intuition is that if it builds and tests pass it's almost certainly correct?

@andrwng
Copy link
Copy Markdown
Contributor Author

andrwng commented Apr 16, 2026

Really cool. The structural ct_string is a nice touch. I need to review with fresh eyes, but my intuition is that if it builds and tests pass it's almost certainly correct?

Yeah that's my intuition too.

bharathv
bharathv previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nitpick but looks good to me, I guess if it compiles, its ok?

Comment thread src/v/datalake/schema_descriptor.h Outdated
Comment on lines +154 to +168
template<typename T>
static constexpr bool is_nested_v = false;
template<typename... Fs>
static constexpr bool is_nested_v<struct_desc<Fs...>> = true;
template<typename E>
static constexpr bool is_nested_v<list_desc<E>> = true;

template<typename F>
static auto build_iceberg_type(int& next_id) {
if constexpr (is_nested_v<typename F::type>) {
return F::type::build_impl(next_id);
} else {
return typename F::type{};
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be a concept, a new (future) type without a is_nested_v specialization defaults to the else branch, maybe a bit of surprise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

oleiman
oleiman previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Member

@oleiman oleiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for "if it compiles it's good"

Replace the hand-rolled schemaless_struct_type() with a compile-time
schema descriptor system. The schema is defined as a type:

```
  using schemaless_desc = struct_desc<
    field_desc<"redpanda", rp_desc>>;
```

From this single definition:
- schemaless_struct_type() generates the runtime iceberg::struct_type
- index_of<"field_name"> provides compile-time field indices
- type_field/value_field accessors replace brittle fields[0]
- rp_struct_type()/rp_struct_value() provide named access

So code like this:

```
std::unique_ptr<iceberg::struct_value> build_rp_struct(
  model::partition_id pid,
  kafka::offset o,
  std::optional<iobuf> key,
  model::timestamp ts,
  model::timestamp_type ts_t,
  const chunked_vector<model::record_header>& headers) {
    auto system_data = std::make_unique<iceberg::struct_value>();
    system_data->fields.reserve(6);

    system_data->fields.emplace_back(iceberg::int_value(pid));
    system_data->fields.emplace_back(iceberg::long_value(o));
    // NOTE: Kafka uses milliseconds, Iceberg uses microseconds.
    system_data->fields.emplace_back(
      iceberg::timestamptz_value(ts.value() * 1000));
    ... brittle, carefully ordered code ...
```

...becomes:

```
std::unique_ptr<iceberg::struct_value> build_rp_struct(
  model::partition_id pid,
  kafka::offset o,
  std::optional<iobuf> key,
  model::timestamp ts,
  model::timestamp_type ts_t,
  const chunked_vector<model::record_header>& headers) {
    return rp_desc::build_value(
      val<"partition">(iceberg::int_value(pid)),
      val<"offset">(iceberg::long_value(o)),
      val<"timestamp">(iceberg::timestamptz_value(ts.value() * 1000)),
```

and the definitions of compile-time schemas becomes more crisply defined
as well.
@andrwng andrwng dismissed stale reviews from oleiman and bharathv via 9552c11 April 21, 2026 04:15
@andrwng andrwng force-pushed the compile-time-iceberg-structs branch from e4024cb to 9552c11 Compare April 21, 2026 04:15
@andrwng andrwng requested review from bharathv and oleiman April 21, 2026 04:15
@andrwng andrwng enabled auto-merge April 21, 2026 06:15
@andrwng andrwng merged commit 31bed1f into redpanda-data:dev Apr 21, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants